Search CORE

Bilkent University Institutional Repository

Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation

Author: Andreas Stolcke
Dilek Hakkani-Tür
Elizabeth Shriberg
Grosz B.
Gökhan Tür
Hearst Marti A
Passonneau Rebecca J
Publication venue
Publication date: 01/01/2000
Field of study

We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.Comment: 27 pages, 8 figure

arXiv.org e-Print Archive

Bilkent University Institutional Repository

The case for automatic higher-level features in forensic speaker recognition

Author: Andreas Stolcke
Elizabeth Shriberg
Publication venue
Publication date: 01/01/2008
Field of study

Abstract Approaches from standard automatic speaker recognition, which rely on cepstral features, suffer the problem of lack of interpretability for forensic applications. But the growing practice of using "higher-level" features in automatic systems offers promise in this regard. We provide an overview of automatic higher-level systems and discuss potential advantages, as well as issues, for their use in the forensic context

Identifying Agreement and Disagreement in Conversational Speech: Use of Bayesian Networks to Model Pragmatic Dependencies

Author: Galley Michel
Hirschberg Julia Bell
McKeown Kathleen
Shriberg Elizabeth
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2004
Field of study

We describe a statistical approach for modeling agreements and disagreements in conversational interaction. Our approach first identifies adjacency pairs using maximum entropy ranking based on a set of lexical, durational, and structural features that look both forward and backward in the discourse. We then classify utterances as agreement or disagreement using these adjacency pairs and features that represent various pragmatic influences of previous agreement or disagreement on the current utterance. Our approach achieves 86.9% accuracy, a 4.9% increase over previous work

The case for automatic higher-level features in forensic speaker recognition

Author: Andreas Stolcke
Elizabeth Shriberg
Publication venue
Publication date: 01/01/2008
Field of study

Pauses in Deceptive Speech

Author: Benus Stefan
Enos Frank
Hirschberg Julia Bell
Shriberg Elizabeth
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

We use a corpus of spontaneous interview speech to investigate the relationship between the distributional and prosodic characteristics of silent and filled pauses and the intent of an interviewee to deceive an interviewer. Our data suggest that the use of pauses correlates more with truthful than with deceptive speech, and that prosodic features extracted from filled pauses themselves as well as features describing contextual prosodic information in the vicinity of filled pauses may facilitate the detection of deceit in speech

Addressee detection for dialog systems using temporal and spectral dimensions of speaking style,” in

Author: Andreas Stolcke
Elizabeth Shriberg
Suman Ravuri
Publication venue
Publication date: 01/01/2013
Field of study

Abstract As dialog systems evolve to handle unconstrained input and for use in open environments, addressee detection (detecting speech to the system versus to other people) becomes an increasingly important challenge. We study a corpus in which speakers talk both to a system and to each other, and model two dimensions of speaking style that talkers modify when changing addressee: speech rhythm and vocal effort. For each dimension we design features that do not require speech recognition output, session normalization, speaker normalization, or dialog context. Detection experiments show that rhythm and effort features are complementary, outperform lexical models based on recognized words, and reduce error rates even if word recognition is error-free. Simulated online processing experiments show that all features need only the first couple seconds of speech. Finally, we find that temporal and spectral stylistic models can be trained on outside corpora, such as ATIS and ICSI meetings, with reasonable generalization to the target task, thus showing promise for domain-independent computerversus-human addressee detectors

Combining Prosodic, Lexical and Cepstral Systems for Deceptive Speech Detection

Author: Enos Frank
Graciarena Martin
Hirschberg Julia Bell
Kajarekar Sachin
Shriberg Elizabeth
Stolcke Andreas
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2006
Field of study

We report on machine learning experiments to distinguish deceptive from non-deceptive speech in the Columbia-SRI-Colorado (CSC) corpus. Specifically, we propose a system combination approach using different models and features for deception detection. Scores from an SVM system based on prosodic/lexical features are combined with scores from a Gaussian mixture model system based on acoustic features, resulting in improved accuracy over the individual systems. Finally, we compare results from the prosodic-only SVM system using features derived either from recognized words or from human transcriptions

Detecting Inappropriate Clarification Requests in Spoken Dialogue Systems

Author: Hirschberg Julia Bell
Liu Alex
Shriberg Elizabeth
Sloan Rose
Stoyanchev Svetlana
Then Mei-Vern
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2014
Field of study

Spoken Dialogue Systems ask for clarification when they think they have misunderstood users. Such requests may differ depending on the information the system believes it needs to clarify. However, when the error type or location is misidentified, clarification requests appear confusing or inappropriate. We describe a classifier that identifies inappropriate requests, trained on features extracted from user responses in laboratory studies. This classifier achieves 88.5% accuracy and .885 F-measure in detecting such requests

System combination using auxiliary information for speaker verification

Author: Argyris Zymnis
Elizabeth Shriberg
Luciana Ferrer
Martin Graciarena
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

Recent studies in speaker recognition have shown that score-level combination of subsystems can yield significant performance gains over individual subsystems. We explore the use of auxiliary information to aid the combination procedure. We propose a modi-fied linear logistic regression procedure that conditions combination weights on the auxiliary information. A regularization procedure is used to control the complexity of the extended model. Several auxiliary features are explored. Results are presented for data from the 2006 NIST speaker recognition evaluation (SRE). When an es-timated degree of nonnativeness for the speaker is used as auxiliary information, the proposed combination results in a 15 % relative re-duction in equal error rate over methods based on standard linear logistic regression, support vector machines, and neural networks